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(57) Abstract: An embedded infomiation 
retrieval system (1(X)) including an embedder 
(104). a ftee-text parser (1 10). a query engine 
(112), a meta search engine (116) and a 
feedback retrieval manager (118). When the 
system is embedded in a text application, the 
free-text parser (110) takes samples of the text 
supplied by the user and segments the samples 
into sentences. The sentences are ranked 
by their content The top content-bearing 
sentences are supplied to the query generator 
to be converted into queries for the query 
dispatcher. For each query, the query dispatcher 
identifies the relevant distributed information 
sources (114) submits the query to them and 
waits for retrievals. The retrievals are passed to 
the retrieval manager (118) and saved locally. 
User feedback is used by the retrieval manager 
(118) persistently and incrementally to improve 
retrieval accuracy. 
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For two-letter codes and other abbreviations, refer to the "Guid- 
ance Notes on Codes and Abbreviations " appearing at the begin- 
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SYSTEM AND METHOD FOR EMBEDDED DISTRIBUTED 
INFORMATION RETRIEVAL IN A 
FREE-TEXT APPLICATION ENVIRONMENT 

MICROFICHE/COPYRIGHT REFERENCE 

A Microfiche Appendix is included in this application (157 frames, 
2 sheets) that contains material which is subject to copyright protection. The 
copyright owner has no objection to the facsimile reproduction by anyone of the 
Microfiche Appendix, as it appears in the Patent and Trademark Office patent files 
or records, but otherwise reserves all copyright rights whatsoever. 

BACKGROUND OF THE INVENTION 

The integration of document processing, query generation and user 
feedback continues to challenge information retrieval (IR) technologies. While 
search portals that are readily accessible on the Internet and corporate intranets 
remain among the most successful information retrieval applications, their ability 
to generate queries and utilize user feedback has some limitations. For example, 
these search portals typically require that users state their information needs in 
explicit queries. While this rigid protocol may benefit information technology 
professionals, lay users have difficulty formulating and satisfying their 
information needs through explicit queries. 

In addition, the query processing mechanism is typically the same for all 
users, and does not allow fast and intuitive customization. Feedback is obtained 
through continuous solicitation of relevance judgments, which disrupts many 
users* information seeking behaviors and subsequently discourages them from 
either using the search portals or providing feedback. Even when provided, 
feedback is commonly utilized in the query space alone. Consequently, the search 
portals' behavior remains the same over multiple interactions. 

While the search portals allow users to perform searches on different topics 
over the Internet, corporate intranets, and private databases, they neither support 
nor integrate with document processing. Thus, to perform a search relevant to the 
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document at hand, users must disengage from document processing to use a 
different application. On the other hand, current text editing and word processing 
applications allow users to create documents about any topic or issue, but lack the 
means to integrate document creation with simultaneous retrieval of relevant 
information. 

Therefore, what is lacking in the art is the integration of document 
processing, query generation and feedback in an application-embedded distributed 
IR system. The implementation of such a system for text processors would make 
IR transparent yet responsive to the needs of common computer users. What is 
needed is a non-intrusive, feedback-sensitive IR system that users can embed into 
their applications to tap into and monitor information sources while still engaged 
in routine usage of those applications. Such applications include text processing, 
spreadsheets and other conmionly used software. The need for such a system is 
motivated by a growing number of information sources with a wealth of data, 
particularly over the Internet, but with few tools to timely and efficiently put the 
data to use. 

SUMMARY OF THE INVENTION 

In view of the above, a system and ^ method are presented for application- 
embedded information retrieval from distributed free-text information sources. An 
20 application's usage is sampled by an embedded IR system. Samples are converted 

into queries to distributed information sources. Retrieval is managed and adjusted 
through a user customized interface. The IR system is preferably embedded in a 
text processor. 

A system for embedded distributed information retrieval includes a module 
25 for embedding a distributed information retrieval system in a computer application 

program. A free-text parser is coupled to the application program. The free-text 
parser is operative to receive continuous scheduled reads of textual information 
from the application program, parse the textual information into sentences, and 
rank the sentences by their content-bearing capacities. A query engine is coupled 
30 to receive free-text sentences and generate structured queries in response thereto. 



10 
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The query engine includes a semantic network processor program, and is coupled 
to at least one knowledge base. A metasearch engine is coupled to receive and 
submit the structured queries to at least one information source. A retrieval 
manager is coupled to the metasearch engine. The retrieval manager receives the 
retrieved links associated with the structured queries, and ranks and filters the 
retrieved links based upon predefined criteria. 

A method for generating structured queries in an embedded distributed 
information retrieval environment includes receiving continuous scheduled reads 
of textual information, and parsing the textual information into sentences. The 
found sentences are ranked by their content-bearing capacities based on their 
terms, i.e., words and phrases. Structured queries are then generated using a 
semantic network processor program. The structured queries are submitted to at 
least one information source. Retrieved links associated with the structured 
queries are received. The retrieved links are ranked and filtered based upon 
predefined criteria. 

The present invention accordingly provides the integration of document 
processing, query generation and feedback in an application-embedded distributed 
IR system. The presently preferred implementation is to embed such a system in a 
text processor application, but other application programs that include textual or 
numeric data can readily take advantage of the benefits of the invention. These 
benefits include a non-intrusive, feedback-sensitive IR system that users can use to 
automatically tap into information sources while still engaged in routine usage of 
the underlying application program. By automatically generating structured 
queries in the background, such a system allows periodic access to the growing 
number of information sources provided over the Intemet, as well as on 
proprietary and intra-corporate data sources. The frequency of query generation 
and the relevance of retrieved information are controlled by the user to tailor the 
information retrieval process to the user's precise needs and desires. 

These and other features and advantages of the invention will become 
apparent upon a review of the following detailed description of the presently 
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preferred embodiments of the invention, when viewed in conjunction with the 
appended drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. I is a system block diagram showing the embedded information 
retrieval system of the invention. 

FIG. 2 is a flow chart showing one presently preferred text segmentation 
process, 

FIG. 3 is a flow chart showing one presently preferred weight assignment 
process, 

FIG. 4 is a flow chart showing one presently preferred method of automatic 
query generation. 

FIG. 5 is a flow chart showing one presently preferred metasearch engine. 
FIG. 6 is a flow chart showing one presently preferred retrieval manager 
process. 

DETAILED DESCRIPTION OF THE PRESENTLY PREFERRED 
EMBODIMENTS OF THE INVENTION 

Reference is now made to the drawings, where FIG. 1 is a block diagram of 

a system 100 for embedded information retrieval in a distributed free-text 

application environment. The system 100 becomes embedded in an application 

102 by the embedding module 104. The embedding happens through source 

subscription and knowledge base selection. During the source subscription stage, 

the user selects distributed sources from which information is to be retrieved. For 

example, the user selects public or private search portals. During the knowledge 

base selection stage, the user selects a knowledge base 106a-c on a specific area of 

expertise. A knowledge base is a semantic network of concepts organized in terms 

of abstraction and packaging relations. For example, a financial planner selects a 

knowledge base on mutual funds. The information specified by the user during 

the source subscription and knowledge base selection is stored in a user 

profile 108, 
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In the preferred embodiment of the invention, the application 102 in which 
the system 100 becomes embedded is any application that allows the user to enter 
free text through keyboard 120 or voice, e.g., text editors or text editors coupled 
with speech input and speech recognition devices 122. Input data to the 
system 100 comes from existing free-text documents, or may come simultaneously 
as the documents or files are being created or dictated. A free-text parser 110 
takes samples of incoming free text on a schedule preferably specified by the user. 
The schedule can be, for example, every one or five minutes or as much as hourly, 
daily, weekly, monthly, etc. The user also specifies whether the samples are taken 
from the existing documents, or simultaneously as the documents are being 
created. In an alternate embodiment, input data can be scanned into the system 
100 through a scanner device 124 and processed as described above. 

The free-text samples are segmented by the free-text parser 1 1 0 into 
sentences through a pattern-matching process based on regular expressions. A 
percentile of the found sentences are selected for query generation, and are passed 
to the query engine 112. The selection of sentences may be done in one of two 
ways. Preferably, sentences are ranked by their content-bearing capacities, and the 
top percentile of the ranked sentences are chosen. The preferred content-bearing 
ranking of the sentences occurs as follows. The sentences are segmented into 
terms, i.e., words and phrases. Each found term is assigned an importance weight 
based on the term's distribution pattern in all of the found sentences. The rank of 
a sentence is computed from the weights of its terms. In the preferred 
embodiment of the invention, the found sentences are ranked by their content- 
bearing capacities, and the top percentile of the ranked sentences are selected for 
query generation. Alternatively, sentences may be selected randomly. A more 
detailed description of the text segmentation and weighting processes is provided 
below in connection with FIGS. 2-3 

The selected sentences are passed to the query engine 1 12. From each 
received sentence, the query engine 112 generates queries for the subscribed 
information sources 1 14a-c by using a semantic network processor program 
located in the query engine 1 12, and the knowledge bases 106a-c specified by the 
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user. The terms of each sentence are input into the semantic network processor 
program. The semantic network processor program spreads activation from the 
inputs to the nodes of the knowledge bases 106a-c. The activated nodes invoke 
callback procedures associated with them to generate queries from the inputs that 
caused their activation. Each callback procedure thus preferably generates 
syntactically correct queries for a specific information source. The semantic 
network program and its callback procedures therefore translate free-text inputs to 
the query languages of the information sources 1 14a-c selected by the user. A 
more detailed description of the process of generating queries is provided below in 
connection with FIG. 4. A detailed description of one presently preferred* 
semantic network program is also provided below. 

The queries generated by the query engine 1 12 are passed to the 
metasearch engine 1 16. The metasearch engine 1 16 submits each query to the 
appropriate information sources 1 14a-c. Each query is submitted only to those 
sources in whose language it is formulated. As those skilled in the art will 
appreciate, while FIG. 1 depicts only three information sources 1 14a-c for the sake 
of clarity and simplicity, the number of information sources can be substantially 
larger. In the preferred embodiment of the invention, the information sources 
1 14a-c are public search portals, such as AltaVista, Excite, Infoseek, Lycos, or 
Yahoo, or private search portals deployed on corporate intranets and local area 
networks. 

Before dispatching a query to a source, tiie metasearch engine 116 verifies 
that the query is syntactically correct. If the syntax of the query is valid, the 
metasearch engine 116 verifies that the query is appropriate for the particular 
information source 1 14a-c. The verification of query appropriateness is based on 
two factors: the source descriptions and the user evaluations from the user profile 
108. The source descriptions preferably specify the type of information obtainable 
from the information sources 1 14a-c, any timeout intervals, and communication 
protocols used by the information sources 1 14a-c. The timeout interval specifies a 
user programmed time interval that elapses from the submission of a query to the 
reception of retrievals before the system 100 assumes that the source has not 
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responded. The timeout intervals are specified by the user through the embedding 
module 104. For example, the user can elect to wait for responses from the 
source 1 14a-c for as little as several seconds or as long as several hours, days, 
weeks, months, etc. 

The communication between the metasearch engine 116 and the 
information sources 1 14a-c is based on distributed networking protocols such as 
HTTP, COM, and CORBA. A query is dispatched to an information source 1 14a- 
c if it matches the information source's description, and is consistent with previous 
user evaluations. User evaluations of previous retrievals from each source are 
kept in the user profile 108, and can be obtained from a retrieval manager 1 18 on 
demand. After the query is dispatched, the metasearch engine 1 16 waits for an 
appropriate timeout. If the timeout elapses before the reception of retrievals, it is 
assumed that no retrievals were provided. Otherwise, the retrievals received from 
the information source 1 14a-c are passed to the retrieval manager 118. Each 
retrieval preferably specifies the source, the query, and the responses returned after 
the query was submitted. 

Upon reception of the retrievals, the retrieval manager 118 integrates the 
retrievals with the retrievals stored in the user profile 108. The integration 
preferably allows all of the retrievals to be viewable by the user on demand. The 
user inspects the retrievals at his or her convenience and throu^ the retrieval 
manager 1 18 provides voluntary feedback on the relevance of each retrieval. User 
feedback is saved in the user profile 108, and is also provided on demand to the 
metasearch engine 116, The retrieval manager 1 18 preferably partitions user 
feedback into two spaces: global and local. The global space contains user 
preferences and evaluations that are true of all sources and all free-text inputs 
handled by the system 100. The local space contains user preferences and 
evaluations that are true of a subset of all information sources and a subset of all 
free-text inputs. For example, the user may elect to exclude a particular 
information source 1 14a-c from returning retrievals with respect to a particular 
document A more detailed description of the retrieval management process is 
provided below in connection with FIG. 6. 



wo 01/73610 



PCT/USOl/09182 



8 

In the preferred embodiment of the invention, the application 102, 
embedder 104, free-text parser 1 10, query engine 1 12, user profile 108, 
metasearch engine 116 and retrieval manager 118 either reside on one computer, 
or are distributed across a network of computers. The information sources 1 14a-c 
and knowledge bases 106a-c are preferably distributed across a network of 
computers. 

Referring now to FIG. 2, one presently preferred text segmentation process 
is described. Starting with the document 200, a test is performed at step 202 to 
determine if additional text is present that needs to be evaluated. If additional text 
exists, a new line of text is read from the input document 200 at step 204 and it is 
added to a character buffer. A second test is performed at step 206 to determine 
whether a known sentence pattern matches the contents of the character buffer. If 
the contents of the character buffer does not match, the sentence pattern is added at 
step 208 to the list of sentences being compiled for the document 200, and the 
character buffer is cleared. If a match does exist, the test simply branches back to 
determine if further text exists in the document 200 that needs to be evaluated at 
step 202. Once analysis of the document is complete, a list of sentences is 
returned from the text segmentation subroutine at step 210. 

One presently preferred process of assigning weights to the list of sentences 
returned from the text segmentation subroutine is provided in FIG. 3. Referring to 
FIG. 3, the list of retrieved sentences is input to the weight assigning subroutine at 
step 300. At step 302, a table is built of sentence terms from the list of sentences 
received. Next, the weight assignment subroutine determines the weight of each 
term in the table at step 304. 

A term's weight preferably unifies two approaches: inverse document 
frequency (IDF) and condensation clustering (CC). IDF values a term's rarity in 
the set of sentences found in the document 200; CC values terms' non-random 
distribution patterns over the sentences of the document 200. 

The mathematical model is as follows. Let D be the total number of 
sentences. Define / (/, j) to be lis frequency in the y-th sentence dj. Put Aj = 1 if 
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j) > 0, and 0, otherwise. Put A = 2 'V ^''^ IDF-weight, put T«//(0 = Au,/ 
+ log {D/Di), with Aia/a constant. For //s tfidf weight in dj, put T,jidf(i,j) ^'/(i,/) 
(Auff^ log (D/Di)). The CC- weight of is the ratio of the actual number of 
sentences containing at least one occurrence of // over the expected number of 

such sentences: Tec 0) ^ ^cc log (E(P J/Di), where Acc is a constant and D / is a 
random variable assuming Z)/s values. Put ?/ = ^ / Since «/ assumes 1 
and 0 with the respective probabilities of pi and qt^l ^pi^E (w,-) =/?/=l-(l- 

l/D)^. Since 5 , = ^ n,, £ (5 /) = Dpi. For //s //cc weight in 4, put T,/«, (/, J) = 
f{hj) Tec (0- Let^cc = ^irf/. By definition, T^c (0 = T^id/ii) + /^S^i- Hence, the 
lemma: If Acc = ^/rf/. T^c (0 = T«//(0 + log/?/. 

A class of metrics obtains, unifying IDF and CC: Tui/cc (0 = -4 + ^ Tut/ii) + 
Clogpiy where ^, 5, and C are constants. If ^ = Aut/, B= I, and C = 0, T/^«: = T^^/; if 
A = y4cc. ^ 1 , and C = 1 , Tiv/y^c = T^c- Since / approximates the importance of 
ti in Jy, //s weight in dj is given by T^^/cc (/, 7) =/(/, j) T^/cc (0- 

Weights obtained from the weight assignment subroutine are used to 
compute the weight of each sentence, and the sentences are next sorted by their 
weights at step 308. Sentences are then selected by a predetermined threshold or 
random choice at step 310. In the preferred embodiment of the invention, as 
mentioned above, the threshold method is employed using a threshold of 
preferably the top 10%, but the threshold can be preferably modified by the user in 
the user profile 108 (FIG. 1). A list of selected sentences is then returned from the 
weight assigning subroutine at step 312. 

As described above, the returned list of selected sentences is used to 
generate the queries that are dispatched to the information sources 1 14a-c. One 
presently preferred embodiment of the process for generating queries is shown in 
FIG. 4. The list of selected sentences is received by the query generating 
subroutine at step 400, and a test to determine whether the list is empty is 
performed at step 402. So long as sentences remain on the list, the next sentence 
from the top of the list is selected at step 404, and that sentence is propagated 
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through a semantic network of concepts at step 406. A detailed description of one 
presently preferred semantic network format and associated listing for use with the 
invention is provided below. 

A list of activated concepts is identified at step 408 and a test is performed 
at step 410 to determine if the list is empty or not. If the list is empty, the 
subroutine retums back to step 402 to determine if another selected sentence 
exists. If the activated concept list is not empty, the first concept is taken off the 
list at step 412 and a test is performed at step 414 to determine if the concept has 
been identified before. If so, the subroutine moves back to step 4 1 0 for another 
activated concept. If the concept has not been seen, a query is generated for the 
concept, which is added to a list of generated queries at step 416. Afterwards, the 
subroutine branches back to step 402 for additional selected sentences. Once the 
list of selected sentences 400 is depleted, the query generation subroutine retums a 
list of generated queries at step 418. 

In the preferred embodiment of the invention, a metasearch engine 116 
(FIG. 1) is used to take the list of generated queries and dispatch the queries to the 
appropriate information source 1 14a-c. One presently preferred embodiment of the 
metasearch engine is shown in connection with FIG. 5. 

Referring to FIG. 5, the list of generated queries is received by the 
metasearch subroutine at step 500. A test is performed initially at step 502 to 
determine if additional queries remain on the list. If so, the next query is taken off 
the list at step 504 and a list of relevant information sources 1 14a-c is obtained at 
step 506. If the list of relevant information sources 1 14a-c is empty, as determined 
at step 508, the metasearch subroutine branches back to step 502. If not, the query 
is submitted to each information source 1 14a-c on the list at step 512, The 
metasearch subroutine then waits at step 514 for the pre-established wait interval 
to receive a response firom the respective information source 1 14a-c. If a timeout 
occurs and no information was retrieved, as described above, the metasearch 
subroutine branches back to step 508. If information was retrieved within the pre- 
established time period, the information retrieved is saved in a table and processed 
at step 5 1 6. The metasearch subroutine then branches back to step 502 for any 
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additional queries. If no queries remain on the list of generated queries, the 
metasearch subroutine returns the table of the retrieved information at step 518. 
The retumed table maps each information source 1 14a-c to its retrievals. 

The inforaiation retrieved from the queries is processed by the retrieval 
manager 1 18 shown in FIG. 1. One presently preferred process for managing the 
retrievals is shown in FIG. 6. Referring to FIG. 6, the table of retrievals is 
received at step 600 by the retrieval manager subroutine. A test is initially 
performed at step 602 to determine if the table is empty. If not, the next entry is 
taken at step 604, and a test is performed at step 606 to see if the particular entry 
has been retumed by this information source 1 14a-c before. If so, the subroutine 
branches back to step 602. If not, the particular entry in the table of retrievals 600 
is entered into a database at step 608. In this manner, the retrieval manager 
subroutine processes all of the retrievals in the table 600 until no more retrievals 
exist, and the subroutine exits at step 610. 

The presently preferred computer program listing for implementing the 
above methods and functions is included in the Microfiche Appendix. This 
program is written in the Conunon Lisp Object System (CLOS) programming 
language and the JAVA programming language. As those skilled in the art will 
appreciate, however, the methods and functions described herein can be 
implemented in any number of common computer programming languages 
without departing from the essential spirit and scope of the invention. 

Operation of the preferred embodiment of the invention is best illustrated 
with the following example where X is a computer science researcher working on 
a grant proposal on intelligent networking protocols. Due to intensive competition 
and rapidly approaching deadlines, it is vital that X keep abreast of the most recent 
developments in the field. While X knows many relevant information sources, X 
cannot take full advantage of them because of their size, dynamic nature, and lack 
of adequate search tools. Once embedded in X's word processor, X can employ 
the embedded information retrieval system 100 to generate automatic queries from 
the text of X*s grant proposal in the background, submit those queries to relevant 
information sources 1 14a-c, and save the received retrievals locally. Then, X can 
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inspect the found retrievals when convenient and provide feedback on their quality 
and relevance. Since the system 100 operates in the background, the retrieval of 
relevant information occurs as a by-product of X's routine document creation. 

The above example is easily generalized to other user populations, e.g., 
attorneys, newspaper reporters, technical writers, etc., who need relevant 
information to come to their desktops without disrupting their routine document 
creation activities. Additionally, the system can also be embedded in other 
application programs besides word processors, such as spreadsheet and database 
programs, to name just a few. 

A detailed description of one presently preferred semantic network that can 
be used with the systems and methods described above is provided below. 

Let 91 and 5Vdenote reals and naturals, respectively. All subscripts are in 
!A/; unless otherwise specified. If 5 is a set, 2^ denotes its power set, i.e., the set of 
all subsets of S, and |S| denotes its cardinality. The subset relationship is denoted 
by Q. The logical if is denoted by =>; the logical if and only if is denoted by <=> or 
ifjf. If is a vector space, dim{V) denotes the dimension of V. For example, if V is 
a plane, dim( V)=2, 

Elements forming a sequence are written inside a pair of matching square 
brackets [eo,...,en]. The empty sequence is written as Q. Elements forming a set 
are written inside curly braces: {eo,...,e„} . The enq>ty set is written as {} or 0. 
Elements forming a vector are written inside angular brackets: <eo,.. .,€„>. For 
example, [0,1,2], {0,1,2}, <0,1,2> denote a sequence, a set, and a vector, 
respectively. If v is a variable, {v}, [v],l* , v denote that v is a set, a sequence, a 
vector, and an element, respectively. For example, {v} = {0,1,2}; [v] = [0Af2]\^ 
= <0,1,2>; V = 1. Furthermore, {vj denotes a set of one element v/; {v}i denotes 
the i-th set of elements; [v,] denotes a sequence with one element V/; [v], denotes 
the i'th sequence of elements. If S is a set, [S] is the set of all possible sequences 
over S. For example, [91] is the set of all sequences of reals. 

The functions head and tail return the first element and the rest of the 
elements in a sequence respectively, that is, head({]) = /iea^([eo,ei,...,e„]) = Cq, 
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tail([]) = to//([eo]) = [], /a/7([eo,ei,...,e„]) = [e,,...,e„]. The function cone 
concatenates its first argument to its second argument. For example, - 
co/ic(v,[eo,...,e„]) = [v, eo,...,en], co«c([v], [eo,...,en]) = [[v], eo,. .,e„], and 

cawc([v], □) = [[v]]. The function apnd is defined by apnd([v], [w]) = [^^ , 

<,....e^],m>0, [v] = [el,,„,ernlM-e^,.„.e^lapnd(U. M) = [v], 

apndHv], []) = [v]. If [S]o, [S]i [S]„ are sequences, ^ [S]i = apnd([S]o, 

apnd([S]u apnd([S]„,u A sequence [S]i completes a sequence [Sh iff 

[Sh = [eo, ... e„] and = [cq] + [v]o + [e,] + [v], +...+[eJ + [v]„, where [v],, 
0 < / <n, is a subsequence ofSj, For example, if [S]i = [cq, Cj, e2, e3], [S]2 = [cq, 
62], and [S]3 = [e2, e^], [S]i completes [S]2, but does not complete [S]3. Any 
sequence completes []• 

An object is a 2-tuple [o/, R<„], where o, e /= {Ojlj e :v} is the object's 
unique id, and r/ is the objects set of representations. The definition of 
representation depends on specific retrieval tasks. For example, objects can be 
represented as vectors of reals or as nodes in a semantic network. A retrieval 
model M operates in a universe of objects. The universe is the set of all objects, 
and is. denoted by Q. A^'s primitives are called tokens. The definition of token 
depends on the context. For example, tokens can be keywords, keyword 
collocations, or nodes in a semantic network. The set of all possible tokens is 
denoted by T Afs representation function is a bijection X: Ix 2^-> 91, where 91 
is ATs set of representations. The finite set of objects retrievable by M is denoted 
by A c: fi. Formally, A - {[o., {r}]|X<0/, 7) = r}. Since the second element of 
every object in A is a singleton, i.e., a set of one element, the set notation is 
dropped for the sake of simplicity. Thus, A = {[o„ r]|^(o„ 7) = r} . While an 
object's id is unique in the universe, the object's representation is unique only 
within a model. Two different models may represent the same object differently. 
However, since the representation function is a bijection, the object's 
representation is unique within a model. 
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Let A/= {o/l [o/, rj e A} . Since there is a bijection between A and A/, 
when the context permits, A and A/ are used interchangeably and the objects are 
referred to by their ids, i.e., the elements of A/. The token weight function 
G> : Ix T-^ i? assigns weights to tokens in objects. The object similarity function 
a : Q X Q 91 computes the similarity between two objects in Q. The rank 
function p: iCixCl iATimposes an ordering on A*s objects. The rank of O/ € A 
with respect to o^ e £2 is denoted by >c<0/, Og)=x & 5V; then (Vojt e A) {{p (ot, o^ 
<x} <=> {oiOk , o^) > o-(o„ o^)} V {o(o^, o^) = a(o„ o^) A A: < /} } , and (Voy € A) 
{ {p (oy, o^ > x) {o{Oj , o^) < o-(o„ Oq)} V {o{oj, o^) = o(o„ o^) A I <y}. Thus, 
the ranking of objects is determined by (xand their initial ordering in A. Formally,. 
A/=[Q, A, 7;X, <2), a;p]. 

N-ary relations on objects are represented as n-dimensional bit arrays. For 
example, a binary relation is represented as a matrix whose rows and columns are 
objects and whose entries are O's and Ts, depending on whether the relation holds 
between a given pair of objects. 

A retrieval sequence returned by M in response to e Q is denoted by 
M(o^), and is a permutation [On{\)^ <?„(2), o,(n)] of the ids of objects in A such that 
7i(i) < 7i(j) <o p(o/, o^ < p(Oj, Og). Let Mo = [Ao , T,7^, ©o, ^o^Po] and Mi = [Ao , 
T, Xu CDi, Oup/]. Mo and M| are equivalent under ranked retrieval (Mo^^ Mi) iff 
Ao = {[oq, Ao, (oq, 7)], ... , [On, Ao, (o„, 7)], Ai = {[oq, Xu (oq, 7)], ... , [On, Xu (Pn, 
7)]}, and Vog e Q) (Mo(Og) = Mi(o^)), Thus, the two models are equivalent only 
when defined over the same set of tokens. The same methodology is frequently 
used in mathematics when different constructs defined over the same set of 
primitives are shown to be equivalent under specific operations. As a practical 
matter, fixing the set of tokens ensures that comparisons of different models are 
meaningful only when made with respect to one universe over the same inputs. 

Let M = [Q,A, 7',A-,co,a,/9] be a semantic network retrieval model. The set A 
consists of objects each of which is a node in a directed graph G with two types of 
arcs: isa and partof An wa-arc denotes the subclass-superclass relationship 
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between the nodes it connects; a partof-zics denotes the part-whole relationship 
between the nodes. While some semantic networks introduce additional relations, 
isa and partofhave become the standard for abstraction and packaging. Let Aq be 
the |A| X |A| matrix such that Ao[i,j] = 1 if there is an wa-arc from O/ € A and 
Oj e A, and Ao[i,j] = 0, if there is no such arc. LetA\ be a similar matrix for the 
par/q/^relationship. An object Oi abstracts an object Oj iff G has a path of isa-arcs 
from Oj to Oi. When o,- abstracts Oy, a,- is an abstraction of Oj. An object o,- 
specializes an object Oj iff G has a path of isa-arcs from Oi to Oj. Thus, Oi abstracts 
oj iff Oj specializes O/. Any object both abstracts and specializes itself. 

Associated with each node is a single set of labels. A label [x] = [Co, Cn] 
is a sequence of elements such that for all /, 0 < i < n, ej e TUI. Thus, labels may 
contain not only tokens but also object ids. Ifo, e Q, then is the set of labels 
associated with <?/. If e A, and [x]^ = [e^ Cn], e Xi, g(0|, [x]i) - [co(eo,o/), 
<^i^n,Oi)], i.e., g: A X [TUI] -> [91]. An expectation is a 3-tuple [o„ [;c]/, [v],] such 
that Mi = [v]j^ + [v]j. For example, if Mi = [0, 1, 2], then [Oi, Mn [U2]], [o,, 
M^ [2]], and [o/, M/. Q] are expectations. Intuitively, an expectation reflects how 
completed a label is with respect to an object. If z = [o/, M«» ^^^^ 
eobjiz) = o„ eseq{z) = M/> ^C5e^(z) = [v];, and /:e><z) = headecseq(z)). 

Put 7? = /X;, LJy where X/, is the set of labels associated with O/, 
Lq = {oyloy € AA Ao[i, j\ = 1} , Biid Lj ^ {Oj]oj ^ AA Aoii, y] = 1 } . Note that Lq and 
£/ can be empty. For example, ifo, € £2 - A, T^og . 7) = [X^, {}, {}]. Let e 
and o/ € A and let / [SR] x[9l] -> 5R. The object similarity between O/ and Oq is 
a(o/,o<,) = /w«{/Cg(o/,Mi)*g(Oq»M^))}i where Mi ^ X/, M^ « X,, and /ic/, 
completes M/- In the maximization, the ranges of i and g in M/ and M^ are 
0 ^ I < |A^| and 0 < ^ < | A^,|. An object Oq activates an object O/ iff there exists a 
label M^ € -^'iy and a label M/ ^ ^ such that [x]g completes M/- If there is no 
Ml € Xi such that M-y completes Mt> then a(o,, o^) = 0. This formalization of 
spreading activation both generalizes and makes rigorous the node activation 
sequence approach. It also subsumes the spreading activation level approach and 
the activation path shape approach. The former is subsumed insomuch as the 
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activation level of a node becomes a function of &s values. The latter is subsumed 
insomuch as the node activation paths are detemiined by object ids in labels. An 
algorithm for retrieving nodes by spreading activation is given below. 

Let Af be a semantic network retrieval model. Let o be an input object with 
5 A" as its set of labels. Tis a table mapping the id's of objects in A to the scores 

representing their similiarity with the query object, i.e., reals. Initially, T maps 
each id to 0. Let V be the vector representation of T, i.e., F= [[Oo, So], [On, Sn]], 
where [o, e A} a {Si = a(o, )} for all i < / ^ n. Let £ be a table mapping 
expectations to tokens. If ^ is an expectation, then key(€, E) denotes the token to 
10 which £ maps The retrieve procedure retums A/(o). The spread procedure 

activates nodes with at least one conq^leted sequence. 

0 procedure retrieve(o, M, T) 

1 for each [s] in X 

2 T = spread(o, [s], T); 
15 3 convert T to V; 

4 sort V's entries by similarity 

5 in non-increasing order; 

6 sort V's entries with equal similarity 

7 by id in increasing order; 

20 8 return the sequence of ids as they occur 

9 in V from left to righ^ 

10 procedure spread(o, [s], T) 

11 w = g([s],o) 

25 12 for each e in [s] 

1 3 activate(e, T, w); 

14 return T; 

15 procedure activate(e, T, w) 

16 for each abstraction a of e 

30 17 for each expectation e keyed on a 

18 advance(e, T, w); 

19 procedure advance(x, T, w) 

20 if null(ecseq(x)) 
35 21 then 

22 y = f(w, g(eseq(x), eobj(x))); 

23 if(T[eobj(x)]<y) 

24 then T[eobj(x)] = y; 

25 activate(eobj(x)); 
40 26 else 
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27 [v] = tail(ecseq(x)); 

28 ne = newexp(eobj(x), eseq(x), [v]); 

29 key(ne, E) = head([v]); 

30 procedure newexp(o, [x], [v]) 

3 1 return a new expectation [o, [x], [v]] 

As can be seen, the integration of document processing, query generation 
and feedback in an application-embedded distributed IR system provides unique 
advantages over existing systems. Automatic generation of queries from free-text 
documents enables users to retrieve relevant information without disrupting their 
routine document processing activities. Consequently, the retrieval of relevant 
information becomes a by-product of document processing. Customized 
information retrieval and feedback are possible through the incorporation of a user 
profile database 108. Through the feedback feature of the retrieval manager 118, 
the user can control the frequency and content of retrieved information to suit a 
particular document or application. 

The presently preferred embodiment embeds the features and functions of 
the invention in a text processor environment, but other application programs such 
as spread sheet, database and graphical programs can readily benefit from the 
unique aspects of the invention. These benefits include a non-intrusive, feedback- 
sensitive IR system that integrates document processing with simultaneous 
retrieval of relevant information. Users can use the system to automatically tap 
into information sources while still engaged in routine usage of the underlying 
application program. In an alternate embodiment, input files such as documents 
can be scanned into the system. The unique semantic network processor program 
provides the advantage of automatically generating structured queries from free- 
text documents in a term-independent way, thus allowing the retrieval of 
documents similar in content, but not necessarily similar in the way that content is 
described. 

It is to be understood that a wide range of changes and modifications to the 
embodiments described above will be apparent to those skilled in the art, and are 
contemplated. It is therefore intended that the foregoing detailed description be 



wo 01/73610 



PCTAJSOl/09182 



18 

regarded as illustrative, rather than limiting, and that it be understood that it is the 
following claims, including all equivalents, that are intended to define the spirit 
and scope of the invention. 
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I CLAIM: 

1 . An embedded distributed information retrieval system, comprising: 
an embedding module for embedding a distributed information 

retrieval system in a computer application program; 

a free-text parser coupled to the application program, the free-text 
parser operative to receive continuous scheduled reads of textual information from 
the application program, parse the textual information into sentences, and rank the 
sentences on the basis of words and phrases in the sentences; 

a query engine coupled to receive the ranked sentences, and 
operative to generate structured queries, the query engine coupled to at least one 
knowledge base and including a semantic network processor program; 

a metasearch engine coupled to receive and submit the structured 
queries to at least one information source; and 

a retrieval manager coupled to the metasearch engine, the retrieval 
manager operative to receive retrieved links associated with the structured queries, 
and to rank and filter the retrieved links based upon predefined criteria. 

2. The system defined in claim 1, wherein the predefined criteria 
comprise relevancy to the inputted textual information. 

3. The system defined in claim I, wherein the at least one knowledge 
base comprises a semantic network of concepts organized in terms of abstraction 
and packaging relations. 

4. The system defined in claim 1 , wherein the semantic network 
comprises a knowledge base of concepts connected via hierarchical and packaging 
relations. 

5. The system defined in claim 1 , wherein the at least one information 
source comprises a proprietary database. 
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6. The system defined in claim 1 , wherein the at least one information 
source comprises the Internet. 

7. The system defined in claim 1 , wherein the sentences are ranked in 
terms of their content-bearing capacities. 

8. An embedded distributed information retrieval method for 
generating structured queries, comprising the steps of: 

receiving continuous scheduled reads of textual information; 
parsing the textual information into sentences; 
parsing the sentences into words and phrases; 
ranking sentences by their content-bearing capacities based on their 
weighted words and phrases; 

generating structured queries using a semantic network processor 

program; 

submitting the structured queries to at least one information source; 
receiving retrieved links associated with the structured queries; and 
ranking and filtering the retrieved links based upon predefined 

criteria. 

9. The method defined in claim 8, wherein the predefined criteria 
comprise relevancy to the inputted textual information. 

10. The method defined in claim 8, further comprising the step of 
searching the information resources in response to the structured queries, 

1 1 . The method defined in claim 1 0, wherein the step of searching the 
information resources further comprises the step of searching a proprietary 
database or a search portal. 

12. The method defined in claim 10, wherein the step of searching the 
information resources further comprises the step of searching the Internet. 
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13. The method defined in claim 8, further comprising the step of 
providing a metasearch engine to receive and submit the structured queries to the 
at least one information source. 

14. The method defined in claim 13, wherein the semantic network 
program provides its output as input to the metasearch engine. 

15. The method defined in claim 8, wherein the semantic network 
program comprises the step of spreading activation that maps free-text inputs to 
relevant concepts in a knowledge base. 
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